Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Jun 12, 2025

What do these changes do?

While testing for issues within the computational backend, many non critical logs were found about

log_level=ERROR | log_timestamp=2025-06-11 04:15:40,492 | log_source=distributed.client:_reconnect(1530) | log_uid=None | log_oec=None| log_trace_id=d15493584ca29e5b10e2c5590b8b5b55 | log_span_id=5dc3145a361ef4a9 | log_resource.service.name= | log_trace_sampled=True] | log_msg=Failed to reconnect to scheduler after 30.00 seconds, closing client

the dask client were acquired once and cached in the director-v2 and never closed, leading to that error message (which is not a problem but adds to confusion).
This PR aims to reference count the current pipeline for that specific user/wallet and when that count is down to 0, properly closes the client.

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Engage milestone Jun 12, 2025
@sanderegg sanderegg self-assigned this Jun 12, 2025
@sanderegg sanderegg added the a:director-v2 issue related with the director-v2 service label Jun 12, 2025
@codecov
Copy link

codecov bot commented Jun 12, 2025

Codecov Report

Attention: Patch coverage is 94.73684% with 3 lines in your changes missing coverage. Please review.

Project coverage is 87.88%. Comparing base (533c02e) to head (0ec191d).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7880      +/-   ##
==========================================
+ Coverage   87.26%   87.88%   +0.61%     
==========================================
  Files        1592     1428     -164     
  Lines       60884    59218    -1666     
  Branches     1213      612     -601     
==========================================
- Hits        53131    52043    -1088     
+ Misses       7411     6975     -436     
+ Partials      342      200     -142     
Flag Coverage Δ
integrationtests 64.23% <98.18%> (-0.09%) ⬇️
unittests 86.18% <94.73%> (+0.68%) ⬆️
Components Coverage Δ
api ∅ <ø> (∅)
pkg_aws_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library 72.31% <0.00%> (-0.08%) ⬇️
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 85.16% <ø> (+0.05%) ⬆️
agent 96.29% <ø> (ø)
api_server 91.76% <ø> (ø)
autoscaling 96.03% <ø> (∅)
catalog 92.29% <ø> (∅)
clusters_keeper 99.13% <ø> (ø)
dask_sidecar 91.79% <ø> (ø)
datcore_adapter 97.94% <ø> (ø)
director 76.73% <ø> (ø)
director_v2 91.05% <98.18%> (-0.02%) ⬇️
dynamic_scheduler 96.69% <ø> (∅)
dynamic_sidecar 90.09% <ø> (+1.76%) ⬆️
efs_guardian 89.65% <ø> (ø)
invitations 93.00% <ø> (ø)
payments 92.57% <ø> (ø)
resource_usage_tracker 89.09% <ø> (∅)
storage 87.71% <ø> (∅)
webclient ∅ <ø> (∅)
webserver 87.63% <ø> (-0.06%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 533c02e...0ec191d. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sanderegg sanderegg added the release Preparation for pre-release/release label Jun 12, 2025
@sanderegg sanderegg force-pushed the properly-close-clusters-client branch from 537a1c8 to d2d799d Compare June 12, 2025 11:29
@sanderegg sanderegg marked this pull request as ready for review June 12, 2025 11:32
@sanderegg sanderegg requested review from GitHK and pcrespov as code owners June 12, 2025 11:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces reference counting for Dask clients to ensure they are closed when no longer in use, reducing misleading error logs and resource leaks.
Key changes:

  • Added ref parameter to DaskClientsPool.acquire and implemented _client_refs tracking along with a new release_client_ref method.
  • Updated all acquire call sites in the scheduler and tests to pass a unique reference and release it when the pipeline finishes.
  • Added new unit tests for verifying reference-counting behavior and wrapped logging around client creation in log_context.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
services/director-v2/tests/unit/with_dbs/test_utils_dask.py Updated acquire calls in tests to include ref argument
services/director-v2/tests/unit/test_modules_dask_clients_pool.py Updated tests to pass distinct refs and added test_dask_clients_pool_reference_counting
services/director-v2/src/simcore_service_director_v2/modules/dask_clients_pool.py Implemented _client_refs, added release_client_ref, and modified acquire to require ref
services/director-v2/src/simcore_service_director_v2/modules/dask_client.py Replaced manual info logs with log_context around client creation
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_dask.py Passed run_id into acquire calls and added _release_resources to free clients
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_base.py Declared abstract _release_resources and invoked it on pipeline completion
packages/service-library/src/servicelib/rabbitmq/rpc_interfaces/clusters_keeper/clusters.py Defined a constant RPC method name with TypeAdapter.validate_python
Comments suppressed due to low confidence (2)

services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_dask.py:59

  • [nitpick] This constant is a template string, not a literal ref. Consider renaming it to _DASK_CLIENT_RUN_REF_TEMPLATE to clarify its role and avoid confusion.
_DASK_CLIENT_RUN_REF: Final[str] = "{user_id}:{comp_run.run_id}"

services/director-v2/src/simcore_service_director_v2/modules/dask_clients_pool.py:121

  • The indentation of this with log_context block is inconsistent relative to the surrounding code. Align it with the async with block so that the log context cleanly wraps the client acquisition logic.
with log_context(

@sanderegg sanderegg force-pushed the properly-close-clusters-client branch from 94eb7db to e5b3e39 Compare June 12, 2025 15:31
@sanderegg
Copy link
Member Author

@mergify queue

@mergify
Copy link
Contributor

mergify bot commented Jun 12, 2025

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at ddc3e74

@sanderegg sanderegg added the 🤖-automerge marks PR as ready to be merged for Mergify label Jun 12, 2025
@sanderegg sanderegg force-pushed the properly-close-clusters-client branch from e5b3e39 to 0ec191d Compare June 13, 2025 05:33
@sonarqubecloud
Copy link

@mergify mergify bot merged commit ddc3e74 into ITISFoundation:master Jun 13, 2025
95 of 96 checks passed
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Jun 20, 2025
92 tasks
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Aug 5, 2025
88 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤖-automerge marks PR as ready to be merged for Mergify a:director-v2 issue related with the director-v2 service release Preparation for pre-release/release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants